{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5254bc18-fa57-4865-b2d6-db1cb4f411bf",
   "metadata": {
    "id": "5254bc18-fa57-4865-b2d6-db1cb4f411bf"
   },
   "source": [
    "# Contextual Compression and Filtering in RAG"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c72fdaec-8061-47ac-9b48-dcb3ae1cdd4c",
   "metadata": {
    "id": "c72fdaec-8061-47ac-9b48-dcb3ae1cdd4c"
   },
   "source": [
    "### Installing dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "e8b87ed8-7b08-484a-a7a3-6ea54b0dce99",
   "metadata": {
    "id": "e8b87ed8-7b08-484a-a7a3-6ea54b0dce99"
   },
   "outputs": [],
   "source": [
    "!pip install -qU langchain langchain-community huggingface_hub lancedb pypdf python-dotenv transformers sentence-transformers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f931aad1-b0b7-4873-892e-586fff5b27ab",
   "metadata": {
    "id": "f931aad1-b0b7-4873-892e-586fff5b27ab"
   },
   "source": [
    "### Importing libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "96b1065d-f22c-45f3-8286-2d93a54086b8",
   "metadata": {
    "id": "96b1065d-f22c-45f3-8286-2d93a54086b8"
   },
   "outputs": [],
   "source": [
    "from langchain_community.llms import HuggingFaceHub\n",
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain_community.embeddings import SentenceTransformerEmbeddings\n",
    "from langchain.prompts import PromptTemplate\n",
    "from dotenv import load_dotenv\n",
    "from langchain_community.llms import OpenAI\n",
    "import lancedb\n",
    "from langchain.retrievers import ContextualCompressionRetriever\n",
    "from langchain.retrievers.document_compressors import LLMChainExtractor\n",
    "from getpass import getpass\n",
    "import os\n",
    "from langchain_community.embeddings import OpenAIEmbeddings\n",
    "from langchain.retrievers.document_compressors import EmbeddingsFilter\n",
    "from langchain.document_transformers import EmbeddingsRedundantFilter\n",
    "from langchain.retrievers.document_compressors import DocumentCompressorPipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "6a6c70ab-a66e-4913-a063-70f059f46699",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "6a6c70ab-a66e-4913-a063-70f059f46699",
    "outputId": "5f64c561-b910-4a1c-9444-a59026ea1c09"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Enter HuggingFace Hub Token:··········\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "os.environ[\"HUGGINGFACEHUB_API_TOKEN\"] = getpass(\"Enter HuggingFace Hub Token:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "807b9dbb-70c9-4dd0-b578-937c5dc497f7",
   "metadata": {
    "id": "807b9dbb-70c9-4dd0-b578-937c5dc497f7"
   },
   "source": [
    "### Load the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7MBmbHc-VjlS",
   "metadata": {
    "id": "7MBmbHc-VjlS"
   },
   "outputs": [],
   "source": [
    "!wget https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/examples/Contextual-Compression-with-RAG/Science_Glossary.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "627249f2-3861-4ad9-8dc3-f0a53cc9336d",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "627249f2-3861-4ad9-8dc3-f0a53cc9336d",
    "outputId": "ed69d0ac-cd53-476f-b743-b181ef2c9ddf"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7\n",
      "SCIENCE GLOSSARY \n",
      " \n",
      "Abiotic:  A nonliving factor or element (e.g., light, water, heat, rock, energy, mineral). \n",
      " \n",
      "Acid deposition: Precipitation with a pH less than 5.6 that forms in the atmosphere when certain pollutants mix \n",
      "with water vapor. \n",
      " \n",
      "Allele:  Any of a set of possible forms of a gene. \n",
      " \n",
      "Biochemical conversion:  The changing of organic matter into other chemical forms. \n",
      " \n",
      "Biological diversity: The variety and complexity of species present and interacting in an ecosystem and the relative \n",
      "abundance of each. \n",
      " \n",
      "Biomass conversion: The changing of organic matter that has been produced by photosynthesis into useful liquid, gas \n",
      "or fuel. \n",
      " \n",
      "Biomedical technology: The application of health care theories to develop methods, products and tools to maintain or \n",
      "improve homeostasis. \n",
      " \n",
      "Biomes:  A community of living organisms of a single major ecological region. \n",
      " \n",
      "Biotechnology:  The ways that humans apply biological concepts to produce products and provide services. \n",
      " \n",
      "Biotic:  An environmental factor related to or produced by living organisms. \n",
      " \n",
      "Carbon chemistry: The science of the composition, structure, properties and reactions of carbon based matter, \n",
      "especially of atomic and molecular systems; sometimes referred to as organic chemistry. \n",
      " \n",
      "Closing the loop: A link in the circular chain of recycling events that promotes the use of products made with \n",
      "recycled materials. \n",
      " \n",
      "Commodities: Economic goods or products before they are processed and/or given a brand name, such as a \n",
      "product of agriculture. \n",
      " 1\n"
     ]
    }
   ],
   "source": [
    "loader = PyPDFLoader(\"Science_Glossary.pdf\")\n",
    "documents = loader.load()\n",
    "print(len(documents))\n",
    "print(documents[0].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a13a069-8194-405b-b797-58aaec0ee3d6",
   "metadata": {
    "id": "7a13a069-8194-405b-b797-58aaec0ee3d6"
   },
   "source": [
    "### Split texts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "a0935ff9-ec5e-4af3-8194-694d6e09f14b",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "a0935ff9-ec5e-4af3-8194-694d6e09f14b",
    "outputId": "34d44c69-84f8-4716-f335-3f1d664f89d4"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "22\n",
      "page_content='SCIENCE GLOSSARY \n",
      " \n",
      "Abiotic:  A nonliving factor or element (e.g., light, water, heat, rock, energy, mineral). \n",
      " \n",
      "Acid deposition: Precipitation with a pH less than 5.6 that forms in the atmosphere when certain pollutants mix \n",
      "with water vapor. \n",
      " \n",
      "Allele:  Any of a set of possible forms of a gene. \n",
      " \n",
      "Biochemical conversion:  The changing of organic matter into other chemical forms. \n",
      " \n",
      "Biological diversity: The variety and complexity of species present and interacting in an ecosystem and the relative \n",
      "abundance of each. \n",
      " \n",
      "Biomass conversion: The changing of organic matter that has been produced by photosynthesis into useful liquid, gas \n",
      "or fuel.' metadata={'source': 'Science_Glossary.pdf', 'page': 0}\n"
     ]
    }
   ],
   "source": [
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=70)\n",
    "final_doc = text_splitter.split_documents(documents)\n",
    "print(len(final_doc))\n",
    "print(final_doc[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34ffcd9c-ce11-4e48-b382-ba3856e112fb",
   "metadata": {
    "id": "34ffcd9c-ce11-4e48-b382-ba3856e112fb"
   },
   "source": [
    "### Embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "f8934f57-45d0-4e50-89ad-0cb2fc0d5b25",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 297,
     "referenced_widgets": [
      "ff72cdd042ec4eb1b93d5ed713e32c6a",
      "3d53a7815774475cab8ba768a0824f7a",
      "f10d1d61d05c4422b4f0f5b28b871c10",
      "6eb830cdb2ee4404be317e7df439407a",
      "4274ee81e1d44506ae83a2be7624680a",
      "535bab4fadec4e659e9390c661c7f2d8",
      "40149b5869f34953a893e82d8793b3c9",
      "9dd6a518c8f14156b0a1624a4637834e",
      "bd13c80455124051aad2bb4684652a54",
      "dae0fd5714ed41cab9064fbb29d2a7e4",
      "f7ebb41c397c485fbb37e8df8e148c5c",
      "3686d651580e461c9794a821fa08e2da",
      "ff70fea295a346a2bf04a8ca0a128aef",
      "a5c85deb46874a0a9ff676f25b33637b",
      "f6dc96eeeee24393a274c0ff440f8e45",
      "45854fd98a45437eb03f5186a0299ad4",
      "b08e55a776ab4c21b5aa90bab48e99c6",
      "ff253bd67cfe468f8f3fcf610516e1bf",
      "92d568b2c0b24256811bdd2193705c66",
      "39b03446d87144b19ee944dfc6df5348",
      "8d521371fcb14c65b13914be40a2e3f8",
      "6ca7b51b49fd47b49d8b92f98ade1018",
      "74b06fd143a547ce802b2544438b2e8f",
      "a16b2709e87246f3b816b9236a4d8ee8",
      "3f6b95648c3f4aecad7f2a8a265f4ea5",
      "d87411f7c8304fef83ed757ebbddadcc",
      "7b4824dc245540189ef5787e9c9d0e94",
      "7b96b193a965451fa65c94198f29fa5d",
      "2658ef18c20c4f8f88adbbb71b69795b",
      "a44c10da347f4c909171454e9d884f3f",
      "5ea6e489399a4d67915ee87a45b52aeb",
      "e811b25764d24b9e9d0abbfe85f19eb3",
      "3eadc2b8faba4673a89da29004955fb6"
     ]
    },
    "id": "f8934f57-45d0-4e50-89ad-0cb2fc0d5b25",
    "outputId": "f989cd89-a072-4b5d-c943-c109a91cea7c"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<ipython-input-16-486ef1baa5be>:1: LangChainDeprecationWarning: The class `HuggingFaceEmbeddings` was deprecated in LangChain 0.2.2 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-huggingface package and should be used instead. To use it run `pip install -U :class:`~langchain-huggingface` and import as `from :class:`~langchain_huggingface import HuggingFaceEmbeddings``.\n",
      "  embeddings = SentenceTransformerEmbeddings(\n",
      "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n",
      "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
      "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
      "You will be able to reuse this secret in all of your notebooks.\n",
      "Please note that authentication is recommended but still optional to access public models or datasets.\n",
      "  warnings.warn(\n",
      "WARNING:sentence_transformers.SentenceTransformer:No sentence-transformers model found with name llmware/industry-bert-insurance-v0.1. Creating a new one with mean pooling.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ff72cdd042ec4eb1b93d5ed713e32c6a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/808 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3686d651580e461c9794a821fa08e2da",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "74b06fd143a547ce802b2544438b2e8f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "embeddings = SentenceTransformerEmbeddings(\n",
    "    model_name=\"llmware/industry-bert-insurance-v0.1\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b07db212-a314-4f00-a6b1-37fb1c6457cd",
   "metadata": {
    "id": "b07db212-a314-4f00-a6b1-37fb1c6457cd"
   },
   "source": [
    "### Load the LLM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "387d5d94-b66d-455b-9bb1-f62a9aaf1a7a",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "387d5d94-b66d-455b-9bb1-f62a9aaf1a7a",
    "outputId": "63c6355a-2487-4ae3-91c9-444a9bc3d8f3"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<ipython-input-17-2fb2a66d2a61>:2: LangChainDeprecationWarning: The class `HuggingFaceHub` was deprecated in LangChain 0.0.21 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-huggingface package and should be used instead. To use it run `pip install -U :class:`~langchain-huggingface` and import as `from :class:`~langchain_huggingface import HuggingFaceEndpoint``.\n",
      "  llm = HuggingFaceHub(\n"
     ]
    }
   ],
   "source": [
    "repo_id = \"llmware/bling-sheared-llama-1.3b-0.1\"\n",
    "llm = HuggingFaceHub(\n",
    "    repo_id=repo_id, model_kwargs={\"temperature\": 0.3, \"max_length\": 500}\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "92e8431c-03e0-4a22-b934-2f3a076f25b2",
   "metadata": {
    "id": "92e8431c-03e0-4a22-b934-2f3a076f25b2"
   },
   "outputs": [],
   "source": [
    "def pretty_print_docs(docs):\n",
    "    print(\n",
    "        f\"\\n{'-'* 100}\\n\".join(\n",
    "            [f\"Document{i+1}:\\n\\n\" + d.page_content for i, d in enumerate(docs)]\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5f17af2-5b32-4e22-971b-eaf7e09a04a0",
   "metadata": {
    "id": "d5f17af2-5b32-4e22-971b-eaf7e09a04a0"
   },
   "source": [
    "### Instantiate VectorStore (LanceDB)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "09ce605a-93ab-4734-8784-3b57dd330722",
   "metadata": {
    "id": "09ce605a-93ab-4734-8784-3b57dd330722"
   },
   "outputs": [],
   "source": [
    "import lancedb\n",
    "\n",
    "context_data = lancedb.connect(\"./.lancedb\")\n",
    "table = context_data.create_table(\n",
    "    \"context\",\n",
    "    data=[\n",
    "        {\n",
    "            \"vector\": embeddings.embed_query(\"Hello World\"),\n",
    "            \"text\": \"Hello World\",\n",
    "            \"id\": \"1\",\n",
    "        }\n",
    "    ],\n",
    "    mode=\"overwrite\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0ea8f72-5f32-4c79-a76a-123cae04041e",
   "metadata": {
    "id": "b0ea8f72-5f32-4c79-a76a-123cae04041e"
   },
   "source": [
    "### Retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "9f600c90-86b1-4f56-9741-372f8c1e3146",
   "metadata": {
    "id": "9f600c90-86b1-4f56-9741-372f8c1e3146"
   },
   "outputs": [],
   "source": [
    "# initialize the retriever\n",
    "from langchain_community.vectorstores import LanceDB\n",
    "\n",
    "database = LanceDB.from_documents(final_doc, embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "24279bd0-0bc6-4a80-b297-0ec2f8717642",
   "metadata": {
    "id": "24279bd0-0bc6-4a80-b297-0ec2f8717642"
   },
   "outputs": [],
   "source": [
    "retriever_d = database.as_retriever(search_kwargs={\"k\": 3})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "3f3b0406-f2ff-4ca4-bfaa-6063d3cfd249",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "3f3b0406-f2ff-4ca4-bfaa-6063d3cfd249",
    "outputId": "8c17963a-6245-479b-b938-bd46cec6b4b4"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<ipython-input-22-e0bd1dd2283d>:1: LangChainDeprecationWarning: The method `BaseRetriever.get_relevant_documents` was deprecated in langchain-core 0.1.46 and will be removed in 1.0. Use :meth:`~invoke` instead.\n",
      "  docs = retriever_d.get_relevant_documents(query=\"What is Wetlands?\")\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document1:\n",
      "\n",
      "body of water; also called a drainage basin. \n",
      "  \n",
      "Wetlands: Lands where water saturation is the dominant factor determining the nature of the soil \n",
      "development and the plant and animal communities (e.g., sloughs, estuaries, marshes). \n",
      " \n",
      " \n",
      " 7\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document2:\n",
      "\n",
      "developed state. \n",
      " \n",
      "Endangered species:  A species that is in danger of extinction throughout all or a significant portion of its range. \n",
      " \n",
      "Engineering: The application of scientific, physical, mechanical and mathematical principles to design \n",
      "processes, products and structures that improve the quality of life. \n",
      " \n",
      "Environment: The total of the surroundings (air, water, soil, vegetation, people, wildlife) influencing each living \n",
      "being’s existence, including physical, biological and all other factors; the surroundings of a plant \n",
      "or animals including other plants or animals, climate and location. \n",
      " 2\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document3:\n",
      "\n",
      "Niche (ecological): The role played by an organism in an ecosystem; its food preferences, requirements for shelter, \n",
      "special behaviors and the timing of its activities (e.g., nocturnal, diurnal), interaction with other \n",
      "organisms and its habitat. \n",
      " \n",
      "Nonpoint source pollution: Contamination that originates from many locations that all discharge into a location (e.g., a lake, \n",
      "stream, land area). \n",
      " \n",
      "Nonrenewable resources: Substances (e.g., oil, gas, coal, copper, gold) that, once used, cannot be replaced in this geological \n",
      "age. \n",
      " \n",
      "Nova: A variable star that suddenly increases in brightness to several times its normal magnitude and\n"
     ]
    }
   ],
   "source": [
    "docs = retriever_d.get_relevant_documents(query=\"What is Wetlands?\")\n",
    "pretty_print_docs(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e6dd9e2-26c5-4db5-b900-69455da86a55",
   "metadata": {
    "id": "3e6dd9e2-26c5-4db5-b900-69455da86a55"
   },
   "source": [
    "### Compressor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "fe7799d1-5f00-4b99-8fc2-f1054cbb9c23",
   "metadata": {
    "id": "fe7799d1-5f00-4b99-8fc2-f1054cbb9c23"
   },
   "outputs": [],
   "source": [
    "# creating the compressor\n",
    "compressor = LLMChainExtractor.from_llm(llm=llm)\n",
    "\n",
    "# compressor retriver = base retriever + compressor\n",
    "compression_retriever = ContextualCompressionRetriever(\n",
    "    base_retriever=retriever_d, base_compressor=compressor\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "287004b8-9d32-4c44-a038-889d826e5799",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "287004b8-9d32-4c44-a038-889d826e5799",
    "outputId": "7bcca812-f3eb-4c45-886d-09b24623dbaa"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "··········\n",
      "Document1:\n",
      "\n",
      "Niche (ecological): The role played by an organism in an ecosystem; its food preferences, requirements for shelter, \n",
      "special behaviors and the timing of its activities (e.g., nocturnal, diurnal), interaction with other \n",
      "organisms and its habitat. \n",
      " \n",
      "Nonpoint source pollution: Contamination that originates from many locations that all discharge into a location (e.g., a lake, \n",
      "stream, land area). \n",
      " \n",
      "Nonrenewable resources: Substances (e.g., oil, gas, coal, copper, gold) that, once used, cannot be replaced in this geological \n",
      "age. \n",
      " \n",
      "Nova: A variable star that suddenly increases in brightness to several times its normal magnitude and\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document2:\n",
      "\n",
      "developed state. \n",
      " \n",
      "Endangered species:  A species that is in danger of extinction throughout all or a significant portion of its range. \n",
      " \n",
      "Engineering: The application of scientific, physical, mechanical and mathematical principles to design \n",
      "processes, products and structures that improve the quality of life. \n",
      " \n",
      "Environment: The total of the surroundings (air, water, soil, vegetation, people, wildlife) influencing each living \n",
      "being’s existence, including physical, biological and all other factors; the surroundings of a plant \n",
      "or animals including other plants or animals, climate and location. \n",
      " 2\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document3:\n",
      "\n",
      "and age relationships of rock units and the occurrences of structural features, mineral deposits \n",
      "and fossil localities). \n",
      " \n",
      "Groundwater:  Water that infiltrates the soil and is located in underground reservoirs called aquifers. \n",
      " \n",
      "Hazardous waste: A solid that, because of its quantity or concentration or its physical, chemical or infectious \n",
      "characteristics, may cause or pose a substantial present or potential hazard to human health or \n",
      "the environment when improperly treated, stored, transported or disposed of, or otherwise \n",
      "managed. \n",
      " \n",
      "Homeostasis:  The tendency for a system to remain in a state of equilibrium by resisting change. \n",
      " \n",
      " 3\n"
     ]
    }
   ],
   "source": [
    "os.environ[\"OPENAI_API_KEY \"] = getpass()\n",
    "embdeddings_filter = EmbeddingsFilter(embeddings=embeddings)\n",
    "compression_retriever_filter = ContextualCompressionRetriever(\n",
    "    base_retriever=retriever_d, base_compressor=embdeddings_filter\n",
    ")\n",
    "\n",
    "compressed_docs = compression_retriever_filter.get_relevant_documents(\n",
    "    query=\"What is the Environment?\"\n",
    ")\n",
    "pretty_print_docs(compressed_docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fba3d435-4ab7-4792-85df-d39a1a5ea8ae",
   "metadata": {
    "id": "fba3d435-4ab7-4792-85df-d39a1a5ea8ae"
   },
   "source": [
    "### Retrieve answer from Compressed Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "d010a946-e53a-467c-aac6-69c454d48d13",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "d010a946-e53a-467c-aac6-69c454d48d13",
    "outputId": "b8eecae8-4e1f-4d24-de9f-bc53cc6b8ce4"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<ipython-input-26-a460cd4d674d>:7: LangChainDeprecationWarning: The method `Chain.__call__` was deprecated in langchain 0.1.0 and will be removed in 1.0. Use :meth:`~invoke` instead.\n",
      "  qa(\"What is Environment?\")\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\u001b[1m> Entering new RetrievalQA chain...\u001b[0m\n",
      "\n",
      "\u001b[1m> Finished chain.\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'query': 'What is Environment?',\n",
       " 'result': \"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\\n\\nNiche (ecological): The role played by an organism in an ecosystem; its food preferences, requirements for shelter, \\nspecial behaviors and the timing of its activities (e.g., nocturnal, diurnal), interaction with other \\norganisms and its habitat. \\n \\nNonpoint source pollution: Contamination that originates from many locations that all discharge into a location (e.g., a lake, \\nstream, land area). \\n \\nNonrenewable resources: Substances (e.g., oil, gas, coal, copper, gold) that, once used, cannot be replaced in this geological \\nage. \\n \\nNova: A variable star that suddenly increases in brightness to several times its normal magnitude and\\n\\ndeveloped state. \\n \\nEndangered species:  A species that is in danger of extinction throughout all or a significant portion of its range. \\n \\nEngineering: The application of scientific, physical, mechanical and mathematical principles to design \\nprocesses, products and structures that improve the quality of life. \\n \\nEnvironment: The total of the surroundings (air, water, soil, vegetation, people, wildlife) influencing each living \\nbeing’s existence, including physical, biological and all other factors; the surroundings of a plant \\nor animals including other plants or animals, climate and location. \\n 2\\n\\nand age relationships of rock units and the occurrences of structural features, mineral deposits \\nand fossil localities). \\n \\nGroundwater:  Water that infiltrates the soil and is located in underground reservoirs called aquifers. \\n \\nHazardous waste: A solid that, because of its quantity or concentration or its physical, chemical or infectious \\ncharacteristics, may cause or pose a substantial present or potential hazard to human health or \\nthe environment when improperly treated, stored, transported or disposed of, or otherwise \\nmanaged. \\n \\nHomeostasis:  The tendency for a system to remain in a state of equilibrium by resisting change. \\n \\n 3\\n\\nQuestion: What is Environment?\\nHelpful Answer: Environment is the total of the surroundings (air, water, soil, vegetation, people, wildlife) influencing each living being’s existence, including physical, biological and all other factors; the surroundings of a plant or animals including other plants or animals, climate and location.\\n\\nQuestion: What is a list of the five niche categories?\\nHelpful Answer: Niche categories are: \\nNiche (ecological): The role played by an\"}"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.chains import RetrievalQA\n",
    "\n",
    "qa = RetrievalQA.from_chain_type(\n",
    "    llm=llm, chain_type=\"stuff\", retriever=compression_retriever_filter, verbose=True\n",
    ")\n",
    "# Ask Question\n",
    "qa(\"What is Environment?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2052e24a-1f50-408d-bf10-df1c2a4b5af6",
   "metadata": {
    "id": "2052e24a-1f50-408d-bf10-df1c2a4b5af6"
   },
   "source": [
    "# Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "cb8d3941-fd91-47ad-80ef-863df0242b38",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "cb8d3941-fd91-47ad-80ef-863df0242b38",
    "outputId": "7ccb2d95-a758-4737-be4a-059d64893343"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "base_compressor=DocumentCompressorPipeline(transformers=[EmbeddingsRedundantFilter(embeddings=HuggingFaceEmbeddings(client=SentenceTransformer(\n",
      "  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel \n",
      "  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})\n",
      "), model_name='llmware/industry-bert-insurance-v0.1', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False), similarity_fn=<function cosine_similarity at 0x7a582bfddab0>, similarity_threshold=0.95), EmbeddingsFilter(embeddings=HuggingFaceEmbeddings(client=SentenceTransformer(\n",
      "  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel \n",
      "  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})\n",
      "), model_name='llmware/industry-bert-insurance-v0.1', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False), similarity_fn=<function cosine_similarity at 0x7a582bfddab0>, k=5, similarity_threshold=None)]) base_retriever=VectorStoreRetriever(tags=['LanceDB', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.lancedb.LanceDB object at 0x7a5736a97e20>, search_kwargs={'k': 3})\n",
      "Document1:\n",
      "\n",
      "Niche (ecological): The role played by an organism in an ecosystem; its food preferences, requirements for shelter, \n",
      "special behaviors and the timing of its activities (e.g., nocturnal, diurnal), interaction with other \n",
      "organisms and its habitat. \n",
      " \n",
      "Nonpoint source pollution: Contamination that originates from many locations that all discharge into a location (e.g., a lake, \n",
      "stream, land area). \n",
      " \n",
      "Nonrenewable resources: Substances (e.g., oil, gas, coal, copper, gold) that, once used, cannot be replaced in this geological \n",
      "age. \n",
      " \n",
      "Nova: A variable star that suddenly increases in brightness to several times its normal magnitude and\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document2:\n",
      "\n",
      "developed state. \n",
      " \n",
      "Endangered species:  A species that is in danger of extinction throughout all or a significant portion of its range. \n",
      " \n",
      "Engineering: The application of scientific, physical, mechanical and mathematical principles to design \n",
      "processes, products and structures that improve the quality of life. \n",
      " \n",
      "Environment: The total of the surroundings (air, water, soil, vegetation, people, wildlife) influencing each living \n",
      "being’s existence, including physical, biological and all other factors; the surroundings of a plant \n",
      "or animals including other plants or animals, climate and location. \n",
      " 2\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document3:\n",
      "\n",
      "and age relationships of rock units and the occurrences of structural features, mineral deposits \n",
      "and fossil localities). \n",
      " \n",
      "Groundwater:  Water that infiltrates the soil and is located in underground reservoirs called aquifers. \n",
      " \n",
      "Hazardous waste: A solid that, because of its quantity or concentration or its physical, chemical or infectious \n",
      "characteristics, may cause or pose a substantial present or potential hazard to human health or \n",
      "the environment when improperly treated, stored, transported or disposed of, or otherwise \n",
      "managed. \n",
      " \n",
      "Homeostasis:  The tendency for a system to remain in a state of equilibrium by resisting change. \n",
      " \n",
      " 3\n"
     ]
    }
   ],
   "source": [
    "redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)\n",
    "relevant_filter = EmbeddingsFilter(embeddings=embeddings, k=5)\n",
    "\n",
    "# creating the pipeline\n",
    "compressed_pipeline = DocumentCompressorPipeline(\n",
    "    transformers=[redundant_filter, relevant_filter]\n",
    ")\n",
    "\n",
    "# compressor retriever\n",
    "comp_pipe_retrieve = ContextualCompressionRetriever(\n",
    "    base_retriever=retriever_d, base_compressor=compressed_pipeline\n",
    ")\n",
    "\n",
    "# print the prompt\n",
    "print(comp_pipe_retrieve)\n",
    "\n",
    "# Get relevant documents\n",
    "compressed_docs = comp_pipe_retrieve.get_relevant_documents(\n",
    "    query=\"What is Environment?\"\n",
    ")\n",
    "pretty_print_docs(compressed_docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "9afff673-d096-4bd5-b4fb-867cec1c9bda",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "9afff673-d096-4bd5-b4fb-867cec1c9bda",
    "outputId": "f18485c7-cdc4-4d3f-c7b3-c0d5614c8210"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document1:\n",
      "\n",
      "and age relationships of rock units and the occurrences of structural features, mineral deposits \n",
      "and fossil localities). \n",
      " \n",
      "Groundwater:  Water that infiltrates the soil and is located in underground reservoirs called aquifers. \n",
      " \n",
      "Hazardous waste: A solid that, because of its quantity or concentration or its physical, chemical or infectious \n",
      "characteristics, may cause or pose a substantial present or potential hazard to human health or \n",
      "the environment when improperly treated, stored, transported or disposed of, or otherwise \n",
      "managed. \n",
      " \n",
      "Homeostasis:  The tendency for a system to remain in a state of equilibrium by resisting change. \n",
      " \n",
      " 3\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document2:\n",
      "\n",
      "Transportation systems: A group of related parts that function together to perform a major task in any form of \n",
      "transportation. \n",
      " \n",
      "Transportation  \n",
      "technology:  The physical ways humans move materials, goods and people. \n",
      " \n",
      "Trophic levels: The role of an organism in nutrient and energy flow within an ecosystem (e.g., herbivore, \n",
      "carnivore, decomposer). \n",
      " \n",
      "Waste Stream:  The flow of (waste) materials from generation, collection and separation to disposal. \n",
      " \n",
      "Watershed: The land area from which surface runoff drains into a stream, channel, lake, reservoir or other \n",
      "body of water; also called a drainage basin.\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document3:\n",
      "\n",
      "another atom but a different number of neutrons. \n",
      " \n",
      "Recycling:  Collecting and reprocessing a resource or product to make into new products. \n",
      " \n",
      "Regulation: A rule or order issued by an executive authority or regulatory agency of a government and having \n",
      "the force of law. \n",
      " \n",
      "Renewable: A naturally occurring raw material or form of energy that will be replenished through natural \n",
      "ecological cycles or sound management practices (e.g., the sun, wind, water, trees). \n",
      " \n",
      "Risk management: A strategy developed to reduce or control the chance of harm or loss to one’s health or life; the \n",
      "process of identifying, evaluating, selecting and implementing actions to reduce risk to human\n"
     ]
    }
   ],
   "source": [
    "compressed_docs = comp_pipe_retrieve.get_relevant_documents(\n",
    "    query=\"What is Hazardous waste?\"\n",
    ")\n",
    "pretty_print_docs(compressed_docs)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
